This question is about the detection and quantification of viral loads in actual COVID-19 tests.
The data you will work with is from the very first experiment we performed as part of a pilot study in the spring of 2020 to test the sensitivity of an assay system that we developed at NYUAD using the Fluidigm microfluidic platform. We showed that our assay is quantitative and that we could detect very low titers of virus that were previously classified as negative samples in the diagnostic lab. The results of this study have been published:
Xie et al., Processes 2020 (PDF) - Supplement (PDF)
This dataset contains results of qRT-PCR assays for SARS-CoV-2 nasopharyngeal swabs from a clinical laboratory in Abu Dhabi.
The samples were previously classified as “Negative” or “Positive” for presence of SARS-CoV-2 viral particles using a standard CDC-approved diagnostic protocol in a clinical laboratory (“PHD Diagnostics”). These protocols typically involve and RNA extraction step followed by reverse transcription and qPCR amplification. Samples were called “Positive” by PHD if one or more of the viral genes could be detected with PCR amplification within 32 PCR cycles.
The samples were then tested using the NYUAD assay system, which includes an RNA extraction step followed pre-amplification by standard RT-PCR, and then additional cycles of qPCR using the Fluidigm microfluidic assay system. The two-step amplification contributes to the sensitivity of our assay. In addition, very small volumes of reagents are used in the microfluidic system, which both enables numerous replicate assays and lowers the cost of reagents.
RX: 1 negative control for the RNA extraction (extraction performed using buffer only)NT: 1 negative control for RT-PCR (reactions run using amplification mix only)CoV: a dilution series of 3 positive controls containing plasmid DNA encoding the N gene, at 3 different concentrations, for quantification of viral load:
Note that in order to detect viral RNA, we used two different probe sets for the SARS-CoV-2 N gene, N1 and N2. We expect the N1 and N2 assays to be positive for the dilution series controls and for clinical samples that contain viral particles, and negative for the other controls.
The human RP gene is a control for the quality of the clinical samples and should be present in all clinical samples if the swab was ok. We expect the RP assays to be positive for all samples and negative for all controls.
The layout for the Fluidigm chip is illustrated below:
1) Fluidigm output
The raw data file from the Fluidigm system has a lot of extra metadata and some of the column data are not that well labeled. To provide an easier starting point for this analysis, the file has been massaged slightly and supplemented with additional labels. Otherwise the file is pretty much the way it comes off the machine.
The columns are as follows:
S138-A01)RX QC: Quality control for RNA extractionNT QC: Quality control for reverse transcription and preamplificationCoV 50: 50 copies / ulCoV 500: 500 copies / ulCoV 5000: 5000 copies / ul9900039147 or 04MI200640957A)Unknown (you can ignore this)N1: SARS-CoV-2 N gene, probe set 1 (9 replicates)N2: SARS-CoV-2 N gene, probe set 2 (9 replicates)RP: Human RP gene (6 replicates)Pass (good assay) or Flag (bad assay or Value = 999)Pass (you can ignore this)Plate1 or Plate22) PHD diagnostic results
The diagnostic results contains two columns for each sample tested:
Negative or Positive
We would like to answer two basic questions:
1. Is our assay more sensitive than the one used by the clinical diagnostic lab? 2. Can we correlate the viral load we measure in our lab with the status of samples according to the clinical diagnosis?
To do this, we will use linear regression and compare the status of each sample based on our own assay vs. the status based on the clinical lab.
We will need to do some quality control on the data before we start, however, to get rid of data that are not reliable (e.g. due to pipetting errors, contamination, etc.)
Below you will organize the data to make it easier for you to work with it. Feel free to delete any columns you don’t need to use, or to add columns that you think will help you keep track of everything.
Read in the Fluidigm data file and take a look at it.
# Import massaged Fluidigm data file
fluid.all = read.csv("data/Fluidigm_1691207281.csv", stringsAsFactors = TRUE)
head(fluid.all)
Note: To simplify your life, you may wish to remove columns that you are not interested in anymore. You should keep the sample ID and/or SampleName, AssayName, Value, Call, and Plate columns. You will also need to keep rConc for the Standards.
# ============================================================================ #
# negative controls ("RX QC" and "NT QC")
neg.ctls = fluid.all %>% filter(SampleName %in% c("RX QC", "NT QC")) %>%
select(ID,SampleName,AssayName,Value,Call,Plate) %>%
arrange(Plate,ID)
# ============================================================================ #
# standards ("CoV 50", "CoV 500", "CoV 5000")
std.ctls = fluid.all %>% filter(!rConc == 1) %>% # shortcut
select(ID,SampleName,rConc,AssayName,Value,Call,Plate) %>%
arrange(Plate,ID)
It’s always a good idea to check whether your controls look ok. Use whatever method you like to inspect the data and answer the following questions.
(You don’t need to plot the data to answer these questions, but you certainly can if you want…)
# take a look at the data ...
# View(neg.ctls)
# View(std.ctls)
# this is not really a very good way to look at the data ... but all should be 999
ggplot(neg.ctls, aes(x=SampleName, y=Value, fill=SampleName)) +
geom_violin() +
geom_boxplot(width=0.05, fill="white") +
facet_grid(Plate ~ AssayName)
ggplot(std.ctls, aes(x=SampleName, y=Value, fill=SampleName)) +
geom_violin() +
geom_boxplot(width=0.05, fill="white") +
facet_grid(Plate ~ AssayName) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
# Your answer here
No -- there should be no N1, N2, or RP detected for the QC wells, so
we would expect to see Value = 999 and Call = "Flag".
However, some of them amplify something on Plate 2.
This is probably due to cross-contamination or errors during plate loading.
RX controls (RNA extraction):
all three probes show some signal, so it's possible that one of the
positive clinical samples got loaded in the wrong place (prior to RNA extraction).
NT controls (RT and pre-amplifcation)
N1 and N2 are negative, but RP amplified -- possible contamination
from a negative lab sample, or one of the lab people?
# Your answer here
No -- there should be no RP detected, but some of the RP also amplify on Plate 2.
Again, this reflects some kind of contamination somewhere along the line.
# ============================================================================ #
# samples
# subset fluid.all
samples = fluid.all %>%
filter(! SampleName %in% c("RX QC", "NT QC", "Cov 50", "CoV 500", "CoV 500")) %>%
select(ID,SampleName,rConc,AssayName,Value,Call,Plate) %>%
arrange(Plate,ID)
# Import PHD results
phd.calls = read.csv("data/PHD_results.csv", stringsAsFactors = FALSE)
# Add a column to the Fluidigm file containing the PHD results
# use the inner_join() function to do this
samples = inner_join(samples, phd.calls, by.x = c("SampleName"), by.y = c("Sample_barcode"), row.names = FALSE)
## Joining, by = "SampleName"
# clean up
rm(phd.calls)
As we go along we will classify the samples as “Inconclusive”, “Negative”, or “Positive”, according to the NYUAD results. So let’s add a placeholder column that we will fill up as we go along.
samples called “NYUAD_Class” and fill it with NA.# Add a column to sample for NYUAD classes and rename Class to PHD_Class
samples = samples %>% rename(PHD_Class = Class) %>% mutate(NYUAD_Class = NA)
In order to quantify viral load, we want to make sure that our standards look good. First, let’s plot the standards for both plates to see how they look.
ggplot to plot the data with a simple linear regression line of Ct against the concentration.
color to AssayName and facet the plot by Plate to visualize the results for N1 and for N2 separately on each plate.Hint: you will need to do the regression using a transformation of the data. Since you have a 10-fold dilution series (50, 500, 5000 copies/ul), you can use a log transformation with base 10 (log10()).
# subset N1 and N2 assays from standard control set
std.ctls.n1.n2 = std.ctls %>% filter(AssayName %in% c("N1","N2"))
# side-by-side graphs for N1 and N2 on Plate 1 and Plate 2
ggplot(std.ctls.n1.n2, aes(x=log10(rConc), y=Value, color=AssayName)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("SARS-CoV-2 dilution series by plate (N1 & N2 probe sets)") +
xlab("log10(viral copies/ul)") + ylab("Ct") +
facet_wrap(~Plate)
## `geom_smooth()` using formula 'y ~ x'
## NOTE: If you want to show the original concentrations instead of the
# log10 concentrations on the graph, you can use this function inside
# the ggplot command
# scale_x_continuous(label = function(x){return(round(10^x,1))})
# with original concentrations
ggplot(std.ctls.n1.n2, aes(x=log10(rConc), y=Value, color=AssayName)) +
geom_point() +
geom_smooth(method="lm") +
ggtitle("SARS-CoV-2 dilution series by plate (N1 & N2 probe sets)") +
scale_x_continuous(label = function(x){return(round(10^x,1))}) +
xlab("Viral copies/ul") + ylab("Ct") +
facet_wrap(~Plate)
## `geom_smooth()` using formula 'y ~ x'
# Your answer here
No -- the N1 and N2 assays look similar, but the lines from the two plates are really different.
Something looks fishy here. It seems that when this experiment was performed, something was not quite right.
summary()# lm for standards with Plate interaction term
# you may do this for N1, N2, or both combined
summary(lm(Value ~ log10(rConc) * Plate, data = filter(std.ctls, AssayName == "N1")))
##
## Call:
## lm(formula = Value ~ log10(rConc) * Plate, data = filter(std.ctls,
## AssayName == "N1"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07253 -0.15805 0.08585 0.47959 0.93936
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.1092 0.3852 52.20 <2e-16 ***
## log10(rConc) -2.6807 0.1366 -19.62 <2e-16 ***
## PlatePlate2 7.6200 0.5448 13.99 <2e-16 ***
## log10(rConc):PlatePlate2 -2.7147 0.1932 -14.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5796 on 50 degrees of freedom
## Multiple R-squared: 0.975, Adjusted R-squared: 0.9735
## F-statistic: 649.5 on 3 and 50 DF, p-value: < 2.2e-16
summary(lm(Value ~ log10(rConc) * Plate, data = filter(std.ctls, AssayName == "N2")))
##
## Call:
## lm(formula = Value ~ log10(rConc) * Plate, data = filter(std.ctls,
## AssayName == "N2"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1103 -0.2844 0.1153 0.5078 1.0081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.4752 0.4083 50.15 <2e-16 ***
## log10(rConc) -2.6758 0.1448 -18.48 <2e-16 ***
## PlatePlate2 8.1692 0.5774 14.15 <2e-16 ***
## log10(rConc):PlatePlate2 -2.8426 0.2048 -13.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6143 on 50 degrees of freedom
## Multiple R-squared: 0.973, Adjusted R-squared: 0.9714
## F-statistic: 601 on 3 and 50 DF, p-value: < 2.2e-16
summary(lm(Value ~ log10(rConc) * Plate, data = std.ctls.n1.n2))
##
## Call:
## lm(formula = Value ~ log10(rConc) * Plate, data = std.ctls.n1.n2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36419 -0.30069 0.01769 0.33912 1.36122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.2922 0.2999 67.67 <2e-16 ***
## log10(rConc) -2.6783 0.1063 -25.19 <2e-16 ***
## PlatePlate2 7.8946 0.4241 18.62 <2e-16 ***
## log10(rConc):PlatePlate2 -2.7787 0.1504 -18.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.638 on 104 degrees of freedom
## Multiple R-squared: 0.9692, Adjusted R-squared: 0.9684
## F-statistic: 1093 on 3 and 104 DF, p-value: < 2.2e-16
# Your answer here
Yes, the difference in slopes is -2.7 or -2.8 (depending on whether one uses N1,
N2, or both in the model).
The P-values for the interaction terms are very small.
If our standard dilutions are ok, we would like the amount of variation explained by the model to be high. We will use a cutoff of \(R^2 > 0.975\) as a QC measure.
Also, for a good PCR amplification, the amount of material should double in each cycle. We can check to see if our standards follow this pattern by computing the efficiency of amplification, which is given by this equation:
\[Conc = (1 + Eff)^{C_t}\]
In practice, it is computed this way:
\[Eff = 10^{-1/slope} - 1\]
where the slope is derived from a linear regression of Cycles to Threshold (Ct) values plotted against the \(log_{10}\) values of the template amounts. A slope of -3.32 indicates an amplification efficiency of 100%.
Efficiency between 90% and 110% is generally considered to be acceptable. This corresponds to a slope between approximately -3.1 and -3.6.
Below, you will try to figure out which dilution series looks ok and which one does not.
summary(your.model)$r.squared).Note: you may choose to put these into a tabular format to make them easier to compare (but this is not needed to answer the question).
# make 4 separate linear models for N1 and N2 on Plate 1 or Plate 2
lm.n1.p1 = lm(Value ~ log10(rConc), data = filter(std.ctls, Plate == "Plate1" & AssayName == "N1"))
lm.n2.p1 = lm(Value ~ log10(rConc), data = filter(std.ctls, Plate == "Plate1" & AssayName == "N2"))
lm.n1.p2 = lm(Value ~ log10(rConc), data = filter(std.ctls, Plate == "Plate2" & AssayName == "N1"))
lm.n2.p2 = lm(Value ~ log10(rConc), data = filter(std.ctls, Plate == "Plate2" & AssayName == "N2"))
# look at the summaries of the models
summary(lm.n1.p1)
##
## Call:
## lm(formula = Value ~ log10(rConc), data = filter(std.ctls, Plate ==
## "Plate1" & AssayName == "N1"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40570 -0.10434 0.00655 0.15202 0.19658
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.10916 0.10929 184.00 <2e-16 ***
## log10(rConc) -2.68073 0.03876 -69.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1644 on 25 degrees of freedom
## Multiple R-squared: 0.9948, Adjusted R-squared: 0.9946
## F-statistic: 4784 on 1 and 25 DF, p-value: < 2.2e-16
summary(lm.n2.p1)
##
## Call:
## lm(formula = Value ~ log10(rConc), data = filter(std.ctls, Plate ==
## "Plate1" & AssayName == "N2"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54804 -0.13911 0.05744 0.14061 0.46089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.47516 0.17758 115.30 <2e-16 ***
## log10(rConc) -2.67578 0.06298 -42.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2672 on 25 degrees of freedom
## Multiple R-squared: 0.9863, Adjusted R-squared: 0.9858
## F-statistic: 1805 on 1 and 25 DF, p-value: < 2.2e-16
summary(lm.n1.p2)
##
## Call:
## lm(formula = Value ~ log10(rConc), data = filter(std.ctls, Plate ==
## "Plate2" & AssayName == "N1"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0725 -1.0247 0.4845 0.5871 0.9394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.7292 0.5337 51.96 <2e-16 ***
## log10(rConc) -5.3955 0.1893 -28.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.803 on 25 degrees of freedom
## Multiple R-squared: 0.9702, Adjusted R-squared: 0.969
## F-statistic: 812.7 on 1 and 25 DF, p-value: < 2.2e-16
summary(lm.n2.p2)
##
## Call:
## lm(formula = Value ~ log10(rConc), data = filter(std.ctls, Plate ==
## "Plate2" & AssayName == "N2"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1103 -1.0903 0.5138 0.5625 1.0081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.6444 0.5494 52.14 <2e-16 ***
## log10(rConc) -5.5184 0.1948 -28.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8267 on 25 degrees of freedom
## Multiple R-squared: 0.9698, Adjusted R-squared: 0.9686
## F-statistic: 802.1 on 1 and 25 DF, p-value: < 2.2e-16
# Look at the slope/efficiency and R^2
# Here is an outline for a data frame containing these data, if you want to use it
# (but you don't have to)
std.ctl.qc = data.frame(Model = c("N1-P1","N2-P1","N1-P2","N2-P2"),
Rsquared = c(summary(lm.n1.p1)$r.squared,
summary(lm.n2.p1)$r.squared,
summary(lm.n1.p2)$r.squared,
summary(lm.n2.p2)$r.squared),
Slope = c(summary(lm.n1.p1)$coef[2],
summary(lm.n2.p1)$coef[2],
summary(lm.n1.p2)$coef[2],
summary(lm.n2.p2)$coef[2]),
Eff = c(10^{-1/summary(lm.n1.p1)$coef[2]} - 1,
10^{-1/summary(lm.n2.p1)$coef[2]} - 1,
10^{-1/summary(lm.n1.p2)$coef[2]} - 1,
10^{-1/summary(lm.n2.p2)$coef[2]} - 1))
std.ctl.qc
These fits don’t look perfect, but one plate looks a lot better than the other. So, you will need to choose one of these to use for quantifying your viral samples.
# Your answer here
We will use the standards from Plate 1:
- R-squared is above 0.975
- The slope/efficiency is not in the ideal range, but it's better than for Plate 2.
NOTE: We indeed verified later that there were some problems with a bunch of wells in one quadrant on one of the plates that were probably due to loading errors / contamination.
Now we need to check on the N1/N2/RP Assays for the clinical samples.
# distribution of Ct values for RP genes in clinical samples
ggplot(filter(samples, AssayName == "RP"),
aes(x=PHD_Class, y=Value, fill=PHD_Class)) +
geom_violin() +
geom_boxplot(width=0.05, fill="white") +
facet_wrap(~as.factor(Plate), nrow=1) +
scale_fill_manual(values=c("#00BFC4","#F8766D")) +
labs(title="Clincal samples: Amplification of host RP gene",
x="Sample Group",
y = "Ct for human RP gene")
# Your answer here
If a sample is good, then the human control gene should be detectable at least.
If not, then we will have to take another sample because a negative result will
be uninformative.
Some of the samples on Plate 2 look invalid (Ct = 999).
Since some of the samples don’t look great, we will filter them before proceeding to the quantification.
samples dataframe (set NYUAD_Class to “Inconclusive”).Note: Since this is tedious, we have filled in this part for you.
# ============================================================================ #
# subset valid samples
valid.sample.ids = samples %>%
group_by(SampleName) %>%
filter(AssayName == "RP" & Call == "Pass") %>%
summarize( TotalPass = sum(Value < 16) ) %>%
filter(TotalPass > 3)
valid.samples = merge(samples, valid.sample.ids, by = "SampleName") %>%
arrange(ID) %>% select(-TotalPass)
# ============================================================================ #
# number of valid samples
nrow(valid.sample.ids)
## [1] 171
# ============================================================================ #
# Flag inconclusive samples
samples$NYUAD_Class[ ! samples$SampleName %in% valid.samples$SampleName ] = "Inconclusive"
You should now have a filtered list with a full set of assays (N1, N2, and RP) for samples that you deem to be of “good quality”.
# Your answer here
171 samples are left.
# get just the RP gene assays
samples.rp = filter(valid.samples, AssayName == "RP")
# violin plot of RP Ct values in filtered data
# all samples
ggplot(filter(valid.samples, AssayName == "RP"),
aes(x=AssayName, y=Value, fill=AssayName)) +
geom_violin() +
geom_boxplot(width=0.05, fill="white") +
facet_wrap(~as.factor(Plate), nrow=1) +
scale_fill_manual(values=c("#00BFC4","#F8766D")) +
labs(title="Clincal samples: Amplification of host RP gene",
x="Probe set",
y = "Ct for human RP gene")
# by PHD diagnosis
ggplot(samples.rp, aes(x=PHD_Class, y=Value, fill=PHD_Class)) +
geom_violin() +
geom_boxplot(width=0.05, fill="white") +
facet_wrap(~as.factor(Plate), nrow=1) +
scale_fill_manual(values=c("#00BFC4","#F8766D")) +
labs(title="Clincal samples: Amplification of host RP gene",
x="Sample Group",
y = "Ct for human RP gene")
If we want to use standards from only one of the plates for quantification, we should probably make sure that the samples amplified similarly on both plates, even though it looks like something went wrong with the standards on one of the plates.
## QQ plots
# Plate 1
rp.p1 = samples.rp %>% filter(Plate == "Plate1") %>% select(Value)
qqnorm(rp.p1$Value)
qqline(rp.p1$Value)
# Plate 2
rp.p2 = samples.rp %>% filter(Plate == "Plate2") %>% select(Value)
qqnorm(rp.p2$Value)
qqline(rp.p2$Value)
## Test for normality
shapiro.test(rp.p1$Value)
##
## Shapiro-Wilk normality test
##
## data: rp.p1$Value
## W = 0.98953, p-value = 0.0006189
shapiro.test(rp.p2$Value)
##
## Shapiro-Wilk normality test
##
## data: rp.p2$Value
## W = 0.97711, p-value = 7.569e-07
## Test for difference in sample distributions
#t.test(rp.p1$Value,rp.p2$Value) # not valid since data don't look normal
# tests for non-normal distributions
ks.test(rp.p1$Value,rp.p2$Value)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: rp.p1$Value and rp.p2$Value
## D = 0.082807, p-value = 0.06021
## alternative hypothesis: two-sided
wilcox.test(rp.p1$Value,rp.p2$Value)
##
## Wilcoxon rank sum test with continuity correction
##
## data: rp.p1$Value and rp.p2$Value
## W = 122634, p-value = 0.07593
## alternative hypothesis: true location shift is not equal to 0
# Your answer here
Both KS and Wilcoxon rank sum test give P-values close to 0.05, but they are a
little above it, so we will go ahead and use the standards from Plate 1 (we
don't really have a good alternative choice anyway...)